class: center, middle, inverse, title-slide # Lecture 8 ## Multiple Groups ### Psych 10 C ### University of California, Irvine ### 04/15/2022 --- ## Comparisons between two groups - Let's look at another example of comparisons between two populations. -- - First we need a research problem or question. -- - We are interested in studying the levels of anxiety in first year students at a university in two different cohorts, the first one started in 2018 and the second one in 2019. -- - The University makes all students take a survey during the first week which includes a scale design to measure anxiety on a scale that goes from 0 to 20. -- - We have been granted access to the data of 30 students of each cohort to analize if there are any differences between the levels of anxiety. -- - Is this a paired samples (within subjects) design or an independent samples (between subjects) design? --- ## First year anxiety - Before we get the results, we want to formalize our models. -- - **Null model:** There are no difference in anxiety levels between students in the 2018 cohort and students in the 2019 cohort. In other words, the anxiety level of each student is an independent sample of the distribution: `$$y_{ij} \sim \text{Normal}(\mu,\sigma_0^2)$$` for `\(i=1,\dots,30\)` students and `\(j = 1, 2\)` where 1 represents that the student belongs to the 2018 cohort and 2 represents students of the 2019 cohort. -- - **Effects model:** The anxiety levels of students in the 2018 cohort are different from the levels of the 2019 cohort. In other words, the anxiety level of a student in group `\(j=1,2\)` where 1 denotes the 2018 cohort and 2 denotes 2019 are an independent sample of the distributions: `$$y_{ij} \sim \text{Normal}(\mu_j, \sigma_e^2)$$` for `\(i = 1,\dots,30\)` students. --- ## Data. - Before we do any analysis, we can look at our data:
--- ## Visualizing data - Now we can look at the distribution of anxiety scores by cohort using a histogram: -- .pull-left[ ```r ggplot(data = anxiety) + aes(x = anxiety) + aes(fill = cohort, color = cohort) + geom_histogram(position="identity", binwidth = 1, alpha = 0.3) + theme_classic() + xlab("Anxiety score") + ylab("Frequency") + guides(fill = guide_legend("Cohort"), color = "none") + theme(axis.title.x = element_text(size = 20), axis.title.y = element_text(size = 20)) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-8_files/figure-html/hist-anxiety-out-1.png" style="display: block; margin: auto;" /> ] --- ## Infering parameter values from observations - Histogram doesn't show any systematic differences. However, to reach a conclusion we need to compare our two models: -- **Null model:** .pull-left[ ```r anxiety <- anxiety %>% mutate("null_pred" = round(mean(anxiety),3), "null_error" = round((anxiety - null_pred)^2,3)) ``` ] .pull-right[ Prediction = 9.067 SSE = 273.704 Mean SE = 4.562 ] **Effects model:** .pull-left[ ```r group_means <- anxiety %>% group_by(cohort) %>% summarise("pred" = round(mean(anxiety),3)) anxiety <- anxiety %>% mutate("eff_pred" = ifelse(test = cohort == "2018", yes = group_means$pred[1], no = group_means$pred[2]), "eff_error" = round((anxiety - eff_pred)^2, 3)) ``` ] .pull-right[ Prediction: - 2018 = 9.033, - 2019 = 9.1 SSE = 273.664 Mean SE = 4.561 ] --- ## Model evaluation - The proportion of error accounted for by the Effects model was: - `\(R^2 = 1.5 \times 10^{-4}\)` -- - In other words, the model that assumes that anxiety levels are different between the two cohorts explains 0.015% of the variability on anxiety levels. -- - In comparison to previous examples, this is a small percentage of error that the effects model is accounting for. --- ## BIC - The BIC associated to the Null model was equal to `\(BIC_0 =\)` 95.157 -- - The BIC associated to the Effects model was: `\(BIC_e =\)` 99.242 -- - Given that the BIC value for the Null model is lower than the BIC of the Effects model, we can conclude that: - The evidence suggests that there are no differences on anxiety levels between First year students of the 2018 and 2019 cohorts. --- class: inverse, middle, center # Multiple Groups --- ## Comparing multiple groups - On the previous example we saw that the evidence we had, suggested that there were no difference on anxiety levels between the two cohorts. However, what would happen if we took into account the 2020 cohort? -- - When our independent variable takes more than two categorical values (e.g. multiple cohorts, multiple tests, etc.) we have to make some changes to our models. -- - The **Null model** will remain the same, and again it formalizes the assumption that there are no differences between the groups. -- - However, the **Effects model** now has to take into account the fact that there are now more than 2 groups. --- class: inverse, center, middle # Effects Model --- ## Effects model - Our new effects model will now formalize the assumption that **at least one** of the groups is different. -- - The problem now will be that we don't exactly know which one of the groups is different from the others. But for now, this is the best that we can do with the two models that we have. -- **Effects model:** Let `\(y_{ij}\)` be the anxiety level of the *i-th* student from the *j-th* cohort, with `\(i = 1, \dots, 30\)` and `\(j = 1, 2, 3\)`; the value 1 represents 2018, 2 represents 2019 and 3 represents the 2020 cohort. Then, each observation is assumed to be an independent sample from one of 3 distributions: `$$y_{ij}\sim\text{Normal}(\mu_j,\sigma_e^2)$$` --- ## Multiple groups: Predictions - Now we have 4 parameters in total for the model, 3 expectations `\((\mu_j)\)` and 1 error `\(\sigma_e^2\)`. -- - Something that doesn't change is our best guess for `\(\mu_j\)`. The estimator for the parameter `\(\mu_j\)` will be the average of each group, except that now we have 3 groups. -- - In our example the prediction for each cohort *j* can be written as: `$$\hat{\mu}_j = \frac{1}{n_j}\sum_{i=1}^{30}y_{ij}$$` -- - Where `\(n_j\)` represents the total number of students in each cohort (in our example this number is the same for all cohorts, 30 students). -- - In other words, our prediction about the anxiety levels `\(\hat{\mu}_j\)` of students that belong to the *j-th* cohort will be the average anxiety level of the *j-th* cohort. --- ## Multiple groups: Mean Squared Error - Our best "guess" or estimator for the error of the Effects model will be similar to the one we had for the two groups case. -- - Again the only difference is that this time we have more than two groups or predictions. -- - In our example about anxiety levels in 3 cohorts of First year students we have that: `$$\hat{\sigma}_e^2=\frac{1}{n}\sum_{j=1}^{3}\sum_{i=1}^{30}\left(y_{ij}-\hat{\mu}_j\right)^2$$` -- - Where this time `\(n\)` represents the total number of students in our data, given that we have 3 cohorts each with 30 students, the total would be equal to `\(90\)`. --- ## Multiple groups: SSE and `\(R^2\)` - The Sum of Squared Errors (SSE) is again equal to the sum of the errors of our predictions: `$$SSE_e = \sum_{j=1}^{3}\sum_{i=1}^{30}\left(y_{ij}-\hat{\mu}_j\right)^2$$` -- - Once again, using the SSE of both models ($SSE_0$ for the Null and `\(SSE_e\)` for the Effects model) we can calculate the proportion of variability accounted for by the effects model: `$$R^2 = \frac{SSE_0 - SSE_e}{SSE_0}$$` --- ## Multiple groups: BIC - Finally, this time we have 3 parameters ( `\(\mu_1\)`, `\(\mu_2\)` and `\(\mu_3\)`) associated to the predictions of our model, so this new Effects model will be more complicated than the previous one. The BIC associated with the effects model will now be equal to: `$$BIC_e = n\ ln\left(\hat{\sigma}_e^2\right) + k\ ln\left(n\right)$$` -- - In our example that has 3 groups we have that `\(k=3\)`. --- class: inverse, center, middle # Example: ## Three cohorts --- ## Multiple groups: Problem - The university has given us access to a sample of 90 students' anxiety scores. The first 30 belong to First year students in the 2018 cohort, the next 30 belong to First year students in the 2019 cohort, and finally, the remaining 30 belong to students in the 2020 cohort. -- - We want to know if the anxiety levels of students differ by cohort or not. -- - Our Null model assumes that there are no differences in anxiety levels between the 3 cohorts. `$$y_{ij}\sim\text{Normal}(\mu, \sigma_0^2)$$` -- - Our Effects model assumes that at least one of the cohorts is different, although we can't say which one. `$$y_{ij}\sim\text{Normal}(\mu_j,\sigma_e^2)$$` --- ## Multiple groups: Inference **Null model** .pull-left[ ```r n_total <- nrow(anxiety) anxiety <- anxiety %>% mutate("null_pred" = mean(anxiety), "null_error" = (anxiety - null_pred)^2) sse_0 <- sum(anxiety$null_error) mse_0 <- 1/n_total * sse_0 ``` ] .pull-right[ Prediction: 10.08 SSE: 598.46 Mean SE: 6.65 ] --- ## Multiple groups: Inference **Effects model** .pull-left[ ```r mean_groups <- anxiety %>% group_by(cohort) %>% summarise("pred" = mean(anxiety)) anxiety <- anxiety %>% mutate("eff_pred" = case_when(cohort == "2018" ~ mean_groups$pred[1], cohort == "2019" ~ mean_groups$pred[2], cohort == "2020" ~ mean_groups$pred[3]), "eff_error" = (anxiety - eff_pred)^2) sse_e <- sum(anxiety$eff_error) mse_e <- 1/n_total * sse_e ``` ] .pull-right[ Prediction: - 2018: 9.17 - 2019: 8.93 - 2020: 12.13 SSE: 407.5 Mean SE: 4.53 ] --- ## Multiple groups: Model evaluation - The proportion of variance accounted for by the effects model was `\(R^2\)` = 0.319. In other words, the effects model accounts for 31.9% of the total variation of anxiety scores of the students on the 3 different cohorts. -- - The BIC of the Null model was equal to 175.01, while the BIC of the effects model was 149.42. -- - According to the BIC values, we should select the effects model. This means, that at least one of the 3 groups is different from the others. -- - Notice that this is not the conclusion that we where looking for. We want to know if the anxiety levels of First year students differ by cohort. However, from our Effects model we can only conclude that at least one of the groups is different. --- ## Multiple groups: other models - Remember that in our first analysis we saw that there where no difference between the cohorts of 2018 and 2019. Nevertheless, once we add the students from the 2020 cohort something changes and now the best model is the one that assumes that there are differences. - Next week we will see that we need to evaluate other models in order to be able to answer our original question about the differences in anxiety scores between First year students of different cohorts.